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Beyond Low Rank + Sparse: 
Multi-scale Low Rank Matrix Decomposition 

Frank Ong, and Michael Lustig 


Abstract — We present a natural generalization of the recent 
low rank + sparse matrix decomposition and consider the 
decomposition of matrices into components of multiple scales. 
Such decomposition is well motivated in practice as data matrices 
often exhibit local correlations in multiple scales. Concretely, 
we propose a multi-scale low rank modeling that represents 
a data matrix as a sum of block-wise low rank matrices 
with increasing scales of block sizes. We then consider the 
inverse problem of decomposing the data matrix into its multi¬ 
scale low rank components and approach the problem via a 
convex formulation. Theoretically, we show that under various 
incoherence conditions, the convex program recovers the multi¬ 
scale low rank components either exactly or approximately. 
Practically, we provide guidance on selecting the regularization 
parameters and incorporate cycle spinning to reduce blocking 
artifacts. Experimentally, we show that the multi-scale low rank 
decomposition provides a more intuitive decomposition than 
conventional low rank methods and demonstrate its effectiveness 
in four applications, including illumination normalization for face 
images, motion separation for surveillance videos, multi-scale 
modeling of the dynamic contrast enhanced magnetic resonance 
imaging and collaborative filtering exploiting age information. 

Index Terms —Multi-scale Modeling, Low Rank Modeling, 
Convex Relaxation, Structured Matrix, Signal Decomposition 

I. Introduction 

Signals and systems often exhibit different structures at 
different scales. Such multi-scale structure has inspired a wide 
variety of multi-scale signal transforms, such as wavelets [1], 
curvelets [2] and multi-scale pyramids |3|, that can represent 
natural signals compactly. Moreover, their ability to compress 
signal information into a few significant coefficients has made 
multi-scale signal transforms valuable beyond compression 
and are now commonly used in signal reconstruction appli¬ 
cations, including denoising 0- compressed sensing 0, 
and signal separation 0-0. By now, multi-scale modeling 
is associated with many success stories in engineering appli¬ 
cations. 

On the other hand, low rank methods are commonly used 
instead when the signal subspace needs to be estimated as 
well. In particular, low rank methods have seen great success 
in applications, such as biomedical imaging (TO) , face recogni¬ 
tion GU and collaborative filtering GT that require exploiting 
the global data correlation to recover the signal subspace 
and compactly represent the signal at the same time. Recent 
convex relaxation techniques GD have further enabled low 
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rank model to be adaptable to various signal processing tasks, 
including matrix completion GU system identification CD 
and phase retrieval (16) , making low rank methods ever more 
attractive. 

In this paper, we present a multi-scale low rank matrix 
decomposition method that incorporates multi-scale structures 
with low rank methods. The additional multi-scale structure 
allows us to obtain a more accurate and compact signal 
representation than conventional low rank methods whenever 
the signal exhibits multi-scale structures (see Figure [T]). To 
capture data correlation at multiple scales, we model our 
data matrix as a sum of block-wise low rank matrices with 
increasing scales of block sizes (more detail in Section E 
and consider the inverse problem of decomposing the matrix 
into its multi-scale components. Since we do not assume an 
explicit basis model, multi-scale low rank decomposition also 
prevents modeling errors or basis mismatch that are commonly 
seen with multi-scale signal transforms. In short, our proposed 
multi-scale low rank decomposition inherits the merits from 
both multi-scale modeling and low rank matrix decomposition. 

Leveraging recent convex relaxation techniques, we propose 
a convex formulation to perform the multi-scale low rank 
matrix decomposition. We provide a theoretical analysis in 
Section [V] that extends the rank-sparsity incoherence results in 
Chandrasekaran et al. GD We show that the proposed convex 
program decomposes the data matrix into its multi-scale com¬ 
ponents exactly under a deterministic incoherence condition. 
In addition, in Section [VlJ we provide a theoretical analysis on 
approximate multi-scale low rank matrix decomposition in the 
presence of additive noise that extends the work of Agarwal 
et al. (18). 

A major component of this paper is to introduce the 
proposed multi-scale low rank decomposition with emphasis 
on its practical performance and applications. We provide 
practical guidance on choosing regularization parameters for 
the convex method in Section JV| and describe heuristics 
to perform cycle spinning (19) to reduce blocking artifacts 
in Section [IX] In addition, we applied the multi-scale low 
rank decomposition on real datasets and considered four 
applications of the multi-scale low rank decomposition: il¬ 
lumination normalization for face images, motion separation 
for surveillance videos, compact modeling of the dynamic 
contrast enhanced magnetic resonance imaging and collab¬ 
orative filtering exploiting age information. (See Section [X] 
for more detail). Our results show that the proposed multi¬ 
scale low rank decomposition provides intuitive multi-scale 
decomposition and compact signal representation for a wide 
range of applications. 
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Fig. 1. An example of our proposed multi-scale low rank decomposition compared with other low rank methods. Each blob in the input matrix is a rank-1 
matrix constructed from an outer product of hanning windows. Only the multi-scale low rank decomposition exactly separates the blobs to their corresponding 
scales and represents each blob as compactly as possible. 


Related work 

Our proposed multi-scale low rank matrix decomposition 
draws many inspirations from recent developments in rank 
minimization (D 03 , p8| , 120]-|24]. In particular, the 
multi-scale low rank matrix decomposition is a generalization 
of the low rank + sparse decomposition proposed by Chan- 
drasekaran et al. [17] and Candes et al. (25). Our multi-scale 
low rank convex formulation also fits into the convex demixing 
framework proposed by McCoy et al. |26)-|28), who studied 
the problem of demixing components via convex optimization. 
The proposed multi-scale low rank decomposition can be 
viewed as a concrete and practical example of the convex 
demixing problem. However, their theoretical analysis assumes 
that each component is randomly oriented with respect to each 
other, and does not apply to our setting, where we observe the 
direct summation of the components. Bakshi et al. (29) pro¬ 
posed a multi-scale principal component analysis by applying 
principal component analysis on wavelet transformed signals, 
but such method implicitly constrains the signal to lie on a 
predefined wavelet subspace. Various multi-resolution matrix 
factorization techniques {30), ED were proposed to greedily 
peel off components of each scale by recursively applying 
matrix factorization. One disadvantage of these factorization 
methods is that it is not straightforward to incorporate them 
with other reconstruction problems as models. Similar multi¬ 
scale modeling using demographic information was also used 
in collaborative filtering described in Vozalis and Margari- 
tis (32). 

II. Multi-scale Low Rank Matrix Modeling 

In this section, we describe the proposed multi-scale low 
rank matrix modeling in detail. To concretely formulate the 
model, we assume that we can partition the data matrix of 
interest Y into different scales. Specifically, we assume that 
we are given a multi-scale partition {Pi}i =1 of the indices 
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Fig. 2. Illustration of a multi-scale matrix partition and its associated multi¬ 
scale low rank modeling. Since the zero matrix is a matrix with the least rank, 
our multi-scale modeling naturally extends to sparse matrices as 1 x 1 low 
rank matrices. 


of an M x N matrix, where each block b in Pi is an order 
magnitude larger than the blocks in the previous scale P{-\. 
Such multi-scale partition can be naturally obtained in many 
applications. For example in video processing, a multi-scale 
partition can be obtained by decimating both space and time 
dimensions. Figures [2] and [4] provide two examples of a 
multi-scale partition, the first one with decimation along two 
dimensions and the second one with decimation along one 
dimension. In Section |Xj we provide practical examples on 
creating these multi-scale partitions for different applications. 

To easily transform between the data matrix and the block 
matrices, we then consider a block reshape operator Rb(X ) 
that extracts a block b from the full matrix X and reshapes 
the block into an ra* x rti matrix (Figure [3]) , where x ru 
is the it h scale block matrix size determined by the user. 

Given an M x N input matrix Y and its corresponding 
multi-scale partition and block reshape operators, we propose 
a multi-scale low rank modeling that models the M x N input 
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Fig. 3. Illustration of the block reshape operator RR 5 extracts block b from 
the full matrix and reshapes it into an mi x ni matrix. Its adjoint operator 
Rj takes an ra* x ni matrix and embeds it into a full-size zero matrix. 


multi-scale components is beneficial for many applications and 
allows us to, for example, extract motions at multiple scales 
in surveillance videos (Section [X]). Since there are many more 
parameters to be estimated than the number of observations, it 
is necessary to impose conditions on 2Q. In particular, we will 
exploit the fact that each block matrix is low rank via a convex 


program, which will be described in detail in section III 


A. Multi-scale low rank + noise 


matrix Y as a sum of matrices J2i=i Xu in which each Xi is 
block-wise low rank with respect to its partition Pi, that is, 

Xi = Y^RJWbSbVj) 

be Pi 

where £4, S 6, and V& are matrices with sizes rrii x r^, x r & 
and rii x rb respectively and form the rank -r& reduced SVD of 
Rb(Xi). Note that when the rank of the block matrix Rb(Xi) 
is zero, we have {£4, S&, V&} as empty matrices, which do not 
contribute to Xi. Figure [2] and [4] provide illustrations of two 
kinds of modeling with their associated partitions. 

Multi-scale Matrix Partition 


Pi P 2 Pl- 1 Pl 


Before moving to the convex formulation, we note that 
our multi-scale matrix modeling can easily account for data 
matrices that are corrupted by additive white Gaussian noise. 
Under the multi-scale low rank modeling, we can think of the 
additive noise matrix as the largest scale signal component and 
is unstructured in any local scales. Specifically if we observe 
instead the following 

L 

Y = Y J X i + X z ( 1 ) 

i=1 

where Xz is an independent and identically distributed Gaus¬ 
sian noise matrix. Then we can define a reshape operator Rz 
that reshapes the entire matrix into an MN x 1 vector and 
the resulting matrix fits exactly to our multi-scale low rank 
model with L +1 scales. This incorporation of noise makes our 
model flexible in that it automatically provides a corresponding 
convex relaxation, a regularization parameter for the noise 
matrix and allows us to utilize the same iterative algorithm 
to solve for the noise matrix. Figure [5] provides an example 
of the noisy multi-scale low rank matrix decomposition. 


Multi-scale Low Rank Modeling 
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Fig. 4. Illustration of another multi-scale matrix partition and its associated 
multi-scale low rank modeling. Here, only the vertical dimension of the matrix 
is decimated. Since a 1 x N matrix is low rank if and only if it is zero, our 
multi-scale modeling naturally extends to group sparse matrices. 
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Fig. 5. An example of the multi-scale low rank decomposition in the presense 
of additive Gaussian noise by solving the convex program {2}. 


By constraining each block matrices to be of low rank, 
the multi-scale low rank modeling captures the notion that 
some nearby entries are more similar to each other than global 
entries in the data matrix. We note that the multi-scale low 
rank modeling is a generalization of the low rank + sparse 
modeling proposed by Chandrasekaren et al. ED and Candes 
et al. (25). In particular, the low rank + sparse modeling can 
be viewed as a 2-scale low rank modeling, in which the first 
scale has block size lxl and the second scale has block size 
M x N. By adding additional scales between the sparse and 
globally low rank matrices, the multi-scale low rank modeling 
can capture locally low rank components that would otherwise 
need many coefficients to represent for low rank + sparse. 

Given a data matrix Y that fits our multi-scale low rank 
model, our goal is to decompose the data matrix Y to its 
multi-scale components {Xi}^ =1 . The ability to recover these 


III. Problem Formulation and Convex Relaxation 

Given a data matrix Y that fits the multi-scale low rank 
model, our goal is to recover the underlying multi-scale com¬ 
ponents {Xi}f =1 using the fact that Xi is block-wise low rank. 
Ideally, we would like to obtain a multi-scale decomposition 
with the minimal block matrix rank and solve a problem 
similar to the following form: 

L 

minimize rank (Rb(Xi)) 

i =1 bePi 
L 

subject to Y = ^ Xi 

i= 1 

However, each rank minimization for each block is combi¬ 
natorial in nature. In addition, it is not obvious whether the 
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direct summation of ranks is a correct formulation as a 1- 
sparse matrix and a rank-1 matrix should intuitively not carry 
the same cost. Hence, the above non-convex problem is not a 
practical formulation to obtain the multi-scale decomposition. 

Recent development in convex relaxations suggests that 
rank minimization problems can often be relaxed to a convex 
program via nuclear norm relaxation [131, [23|, while still 
recovering the optimal solution to the original problem. In 
particular, Chandrasekaren et al. ED and Candes et al., (25) 
showed that a low rank + sparse decomposition can be relaxed 
to a convex program by minimizing a nuclear norm + £1- 
norm objective as long as the signal constituents are incoherent 
with respect to each other. In addition, Candes et al., (25) 
showed that the regularization parameters for sparsity and low 
rank should be related by the square root of the matrix size. 
Hence, there is hope that, along the same line, we can perform 
the multi-scale low rank decomposition exactly via a convex 
formulation. 

Concretely, let us define || • || nu c to be the nuclear norm, the 
sum of singular values, and || • || msv be the maximum singular 
value norm. For each scale i , we consider the block-wise 
nuclear norm to be the convex surrogate for the block-wise 
ranks and define || • ||(^ the block-wise nuclear norm for the 
ith scale as 

11*1(0 = J2 ll- R *>(-)ll„uc 
bePi 

Its associated dual norm || • ||*^ is then given by 

II • ll(i) = max||i? b (-)|| msv 

v J be Pi 

which is the maximum of all block-wise maximum singular 
values. 

We then consider the following convex relaxation for our 
multi-scale low rank decomposition problem: 

L 

minimize VAi||Xi|| (i) 

4l r ...,AL -- 

l < 2 ) 

subject to Y = Xi 

i= 1 

where {A i}f =1 are the regularization parameters and their 
selection will be described in detail in section El 

Our convex formulation is a natural generalization of the 
low rank + sparse convex formulation (T7), (25). With the 
two sided matrix partition (Fig. [2), the nuclear norm applied 
to the lxl blocks becomes the element-wise fT-norm and 
the norm for the largest scale is the nuclear norm. With the 
one sided matrix partition (Fig. [4), the nuclear norm applied 
to 1 x N blocks becomes the group-sparse norm and can 
be seen as a generalization of the group sparse + low rank 
decomposition (D If we incorporate additive Gaussian noise 
in our model as described in Section |II-A[ then we have a 
nuclear norm applied to an MN x 1 vector, which is equivalent 
to the Frobenius norm. 

One should hope that the theoretical conditions from low 
rank + sparse can be generalized rather seamlessly to the 
multi-scale counterpart. Indeed, in Section [VJ we show that 


the core theoretical guarantees in the work of Chandrasekaren 
et al. ED on exact low rank + sparse decomposition can be 
generalized to the multi-scale setting. In section [VI] we show 
that the core theoretical guarantees in the work of Agarwal et 
al. fl8| l on noisy matrix decomposition can be generalized 
to the multi-scale setting as well to provide approximate 
decomposition guarentees. 


IV. Guidance on Choosing Regularization 
Parameters 

In this section, we provide practical guidance on selecting 
the regularization parameters {A i}f =1 - Selecting the regu¬ 
larization parameters {A i}f =1 is crucial for the convex de¬ 
composition to succeed, both theoretically and practically. 
While theoretically we can establish criteria on selecting 
the regularization parameters (see Section |V| and [VI]), such 
parameters are not straightforward to calculate in practice as 
it requires properties of the signal components {Xi}f =1 before 
the decomposition. 

To select the regularization parameters {A*}^ in practice, 
we follow the suggestions from Wright et al. (33) and Fogel 
et al. 134), and set each regularization parameter A^ to be the 
Gaussian complexity of each norm || • ||^\, which is defined as 
the expectation of the dual norm of random Gaussian matrix: 


Ai - UIIGiy (3) 


where ^ denotes equality up to some constant and G is a 
unit-variance independent and identically distributed random 
Gaussian matrix. 

The resulting expression for the Gaussian complexity is the 
maximum singular value of a random Gaussian matrix, which 
has been studied extensively by Bandeira and Handel (35) . The 
recommended regularization parameter for scale i is given by 



(4) 


For the sparse matrix scale with lxl block size, A i ~ 
^/log (MN) and for the globally low rank scale with M x N 
block size, A i ~ \/M + y/~N. Hence this regularization 
parameter selection is consistent with the ones recommended 
for low rank + sparse decomposition by Candes et al. (25), up 
to a log factor. In addition, for the noise matrix with MN x 1 
block size, A^ ^ y/MN, which has similar scaling as in square 
root LASSO (36) . In practice, we found that the suggested 
regularization parameter selection allows exact multi-scale 
decomposition when the signal model is matched (for example 
Figure [lj and provides visually intuitive decomposition for real 
datasets. 

For approximate multi-scale low rank decomposition in the 
presence of additive noise, some form of theoretical guarantees 
for the regularization selection can be found in our analysis in 
Section 0 In particular, we show that if the regularization 
parameters A^ is larger than the Gaussian complexity of 
|| • ||*.) in addition to some “spikiness" parameters, then the 
error between recovered decomposition and the ground truth 
{Xi}f =1 is bounded by the block-wise matrix rank. 
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V. Theoretical Analysis for Exact Decomposition 

In this section, we provide a theoretical analysis of the pro¬ 
posed convex formulation and show that if {Xi}f =1 satisfies a 
deterministic incoherence condition, then the proposed convex 
formulation ^ recovers {Xi}f =1 from Y exactly. 

Our analysis follows similar arguments taken by Chan- 
drasekaren et al. (T7J on low rank + sparse decomposition 
and generalizes them to the proposed multi-scale low rank de¬ 


composition. Before showing our main result (Theorem V.l), 
we first describe the subgradients of our objective function 


(Section V-A) and define a coherence parameter in terms of 


the block-wise row and column spaces (Section V-B}. 


A. Subdifferentials of the block-wise nuclear norms 

To characterize the optimality of our convex problem, we 
first look at the subgradients of our objective function. We 
recall that for any matrix X with {U,S,V} as its reduced 
SVD representation, the subdifferential of || • || nuc at X is given 
by (23), (37), 

<9||AT|| nuc = { UV T + W : W and X have orthogonal row 

and column spaces and ||VE|| msv < 11 

Now recall that we define the block-wise nuclear norm to be 
II ' ll(i) = ^2bePi 11 (’) 11 nuc • Then using the chain rule and the 
fact that Rb(Xi) = UbSbV^ , we obtain an expression for the 
subdifferential of || • ||(^ at Xi as follows: 

3||*i||(i> = j Y R b (■ UbV b + Wb ) ■ lVh and Rb (*0 have 

IbePi 

orthogonal row and column spaces and ||W&|| msv < 1 j> 

To simplify our notation, we define Ei = 
beP■ (JJbVb ) and Ti 1° be a vector space that 
contains matrices with the same block-wise row spaces or 
column spaces as Xi, that is, 

t* = \y: nJ(u h xj + nn T ): ^ e c n * xr sn c 

IbePi 

where rrii x m is the size of the block matrices for scale i and 
rb is the matrix rank for block b. Then, the subdifferential of 
each || • ||(^) at Xi can be compactly represented as, 

0||*i||(i) = {Ei +W i :W i eT l ± and ||^||^ } < l} 

We note that Ei can be thought of as the “sign" of the matrix 
Xi , pointing toward the principal components, and, in the case 
of the sparse scale, is exactly the sign of the entries. 

In the rest of the section, we will be interested in projecting 
a matrix X onto Ti , which can be performed with the 
following operation: 

v Ti (x) = E R b ( UbU b + Rb(x)v b v? 

be Pi 

-UhUjRbO()V b V b T ) 


Similarly, to project a matrix X onto the orthogonal comple¬ 
ment of Ti , we can apply the following operation: 

V Tt {X) = ]T Rj ((/ - U b Uj)R b (X)(I - V b V b T )) 
be Pi 

where I is an appropriately sized identity matrix. 


B. Incoherence 

Following Chandrasekaren et al. (T7), we consider a de¬ 
terministic measure of incoherence through the block-wise 
column and row spaces of Xi. Concretely, we define the 
coherence parameter for the jth scale signal component Xj 
with respect to the ith scale to be the following: 


ii a = max ||W| 
NjeTj, ||N j ||*. ) <l 


(0 


(5) 


Using Hij as a measure of incoherence, we can quantita¬ 
tively say that the jth scale signal component is incoherent 
with respect to the Ah scale if j±ij is small. In the case of low 
rank + sparse, Chandrasekaren et al. G3 provides excellent 
description of the concepts behind the coherence parameters. 
We refer the reader to their paper for more detail. 


C. Main Result 

Given the above definition of incoherence, the following 
theorem states our main result for exact multi-scale low rank 
decomposition: 


Theorem V.l. If we can choose regularization parameters 
{A i}f =1 such that 


^ for i = l, 




.,L 


then {Xi}f =1 is the unique optimizer of the proposed convex 
problem 

In particular when the number of scales L = 2, the condition 
on {/ii 2 ,A 2 i} reduces to 1112^21 < 1/4 and the condition on 
{Ai, A 2 } reduces to 2 fii 2 < A 1 /A 2 < l/( 2 /i 2 i), which is in 
similar form as Theorem 2 in Chandrasekaren et al. ED- 


The proof for the above theorem is given in Appendix [A] 


VI. Theoretical Analysis for Approximate 
Decomposition 

In this section, we provide a theoretical analysis for 
approximate multi-scale low rank decomposition when the 
measurement is corrupted by additive noise as described in 


Section II-A Our result follows arguments from Agarwal et 
al. fl 8 | on noisy 2 -scale matrix decomposition and extends it 
to the multi-scale setting. 

Instead of using the incoherence parameter /i 2J defined for 
the exact decomposition analysis in Section [VJ we opt for 
a weaker characterization of incoherence between scales for 
approximate decomposition, studied in Agarwal et al. fl 8 ). 


Concretely, we consider spikiness parameters ctij between the 
jth signal component Xj and Ah scale norm |j • ||(^ such that, 

ay = IIAIleo 
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for each j ^ i. Hence, if otij is small, we say Xj is not spiky 
with respect to the ith norm. 

For analysis purpose, we also impose the constraints 
||Xjj)^ < OLij in the convex program. That is, we consider 
the solution from the following convex program: 

L 

minimize V A* ||X* || (i) + A 2 \\X Z ||f ro ( 6 ) 

i=l 

L 

subject to Y = ^ Xi + Xz 

i= 1 

11^-11(0 <«« for j*i (7) 

We emphasize that the additional constraints 0 are imposed 
only for the purpose of theoretical analysis and are not 
imposed in our experimental results. In particular, for our 
simulation example in Figure [5j the minimizer of the convex 
program Q, using the recommended regularization parameters 
in Section [IV| satisfied the constraints 0 even when the 
constraints were not imposed. 

Let us define {A i}f =1 and A z to be the errors between the 
ground truth components {Xi}f =1 and Xz and the minimizers 
of convex program 0 . Then, equivalently, we can denote 
{Xi + A i)i =1 and Xz + A z as the minimizers of the convex 
program 0 . The following theorem states our main result for 
approximate decomposition. 


Theorem VI.l. If we choose {A i}f =1 such that 


A i > 2A z 


WzWh 

11 -Vs 11 fro 


+ 2 ai i 


( 8 ) 


Alternating Direction of Multiple Multipliers (ADMM) j38). 
While the proposed convex formulation 0 can be formulated 
into a semi-definite program, first-order iterative methods are 
commonly used when solving for large datasets for their com¬ 
putational efficiency and scalability. A conceptual illustration 
of the algorithm is shown in Figure [ 6 ] 
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Fig. 6. A conceptual illustration of how to obtain a multi-scale low rank 
decomposition. First, we extract each block from the input matrix and perform 
a thresholding operation on its singular value to recover the significant compo¬ 
nents. Then, we subtract these significant components from our input matrix, 
thereby enabling the recovery of weaker, previously submerged components. 


To formally obtain update steps using ADMM, we first 
formulate the problem into the standard ADMM form with 
two separable objectives connected by an equality constraint, 


and A^ such that 


^z > 


\ 


64 J2 x i Y r b 

i=i be Pi 


then the error is bounded by 

L 


£iia 


i 11 fro 


< 


JIM. 

Air 


fro 


Y^JY^ 

be Pi 


(9) 


i=1 i=1 

where < denotes inequality up to a universal constant. 


Hence, when the spikiness parameters are negligible and 
Xz = crG , where G is an independent, identically distributed 
Gaussian noise matrix with unit variance and a is the noise 
standard deviation, choosing A^: ~ ^[||G||f ro ] ^ VMN and 
Ai ~ ^[IIGII*;)] ~ + \/log(MV/max(m j , n»)) 

ensures the condition is satisfied with high probability. This 
motivates the recommended regularization selection in Sec¬ 
tion [TV] 

The proof for the above theorem is given in Appendix [B] and 
follows arguments from Agarwal et al. fl 8 | on noisy matrix 
decomposition and Belloni et al. p 6 | on square root LASSO. 


VII. An Iterative Algorithm for Solving the 
Multi-scale Low Rank Decomposition 

In the following, we will derive an iterative algorithm that 
solves for the multi-scale low rank decomposition via the 


minimize I < Y = Xi > + AdlZdlf^ 

^ l k \ k w < 10 > 

subject to Xi = Zi 

where I{*} is the indicator function. 

To proceed, we then need to obtain the proximal opera¬ 
tors 1391 for the two objective functions I{V = J2i=i^i} 
and J2i=i Ai || (i) • For the data consistency objective I {Y = 
Y^=i Xi}, the proximal operator is simply the projection 
operator to the set. To obtain the proximal operator for the 
multi-scale nuclear norm objective Y^=i Ai||^||(i), we fi rst 
recall that the proximal operator for the nuclear norm ||X|| nuc 
with parameter A is given by the singular value soft-threshold 
operator (23| , 

SVT a (X) = u max(£-A,0 )V t (11) 

Since we defined the block-wise nuclear norm for each scale 
i as ^2 beP . 11 Rb (’) 11 nuc> the norm is separable with respect to 
each block and its proximal function with parameter A^ is 
given by the block-wise singular value soft-threshold operator, 

Blocks VT Ai (V) = ]T Rj (SVT Xi (R b (X))) (12) 

bePi 

which simply extracts every blocks in the matrix, performs 
singular value thresholding and puts the blocks back to the 
matrix. We note that for 1 x 1 blocks, the block-wise singular 
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value soft-threshold operator reduces to the element-wise soft- 
threshold operator and for 1 x TV blocks, the block-wise 
singular soft-threshold operator reduces to the joint soft- 
threshold operator. 

Putting everything together and invoking the ADMM 
recipe (38) , we have the following algorithm to solve our 
convex multi-scale low rank decomposition 

Xi <- (Zt -Lh) + \(Y- - Ui) 

Zi <- BlockSVT a . /p (Xi + Ui) 

Ui^Ui- (Zi - Xi) 

where p is the ADMM parameter that only affects the conver¬ 
gence rate of the algorithm. 

The resulting ADMM update steps are similar in essence to 
the intuitive update steps in Figure [ 6 ] and alternates between 
data consistency and enforcing multi-scale low rank. The 
major difference of ADMM is that it adds a dual update 
step with Ui, which bridges the two objectives and ensures 
the convergence to the optimal solution. Under the guarantees 
of ADMM, in the limit of iterations, X t and Z % converge 
to the optimal solution of the convex program and Ui 
converges to a scaled version of the dual variable. In practice, 
we found that ^ 1000 iterations are sufficient without any 
visible change for imaging applications. Finally, we note 
that because the proximal operator for the multi-scale nuclear 
norm is computationally simple, other proximal operator based 
algorithms (39) can also be used. 

VIII. Computational Complexity 

Given the iterative algorithm ( [13] ), one concern about the 
multi-scale low rank decomposition might be that it is signif¬ 
icantly more computationally intensive than other low rank 
methods as we have many more SVD’s and variables to 
compute for. In this section, we show that because we decimate 
the matrices at each scale geometrically, the theoretical compu¬ 
tational complexity of the multi-scale low rank decomposition 
is similar to other low rank decomposition methods, such as 
the low rank + sparse decomposition. 

For concreteness, let us consider the multi-scale partition 
with two-sided decimation shown in Figure [2] and have block 
sizes mi — 2 l ~ 1 and ru = 2 l ~ 1 . Similar to other low rank 
methods, the SVD’s dominate the per iteration complexity 
for the multi-scale low rank decomposition. For an M x N 
matrix, each SVD costs #f lops(M x N SVD) = 0(MN 2 ). 
The per iteration complexity for the multi-scale low rank 
decomposition is dominated by the summation of all the 
SVD’s performed for each scale, which is given by, 

#f lops (M x N SVD) + 4 #f lops (M/2 x N/2 SVD) + 
= 0(MN 2 ) + 0(MN 2 )/2 + 0(MN 2 )/A + ... 

< 2 0(MN 2 ) « #flops(M x N SVD) 

(14) 

Hence, the per-iteration computational complexity of the 
multi-scale low rank with two-sided decimated partition is 



on the order of a M x N matrix SVD. In general, one can 
show that the per-iteration complexity for arbitrary multi-scale 
partition is at most log (TV) times the full matrix SVD. 

While theoretically, the computation cost for small block 
sizes should be less than bigger block sizes, we found that in 
practice the computation cost for computing the small SVD’s 
can dominate the per-iteration computation. This is due to the 
overhead of copying small block matrices and calling library 
functions repeatedly to compute the SVD’s. 

Since we are interested in thresholding the singular values 
and in practice many of the small block matrices are zero as 
shown in Section [X] one trick of reducing the computation 
time is to quickly compute an upper bound on the maximum 
singular value for block matrices before the SVD’s. Then if 
the upper bound for the maximum singular value is less than 
the threshold, we know the thresholded matrix will be zero 
and can avoid computing the SVD. Since for any matrix X , 
its maximum singular value is bounded by the square root 
of any matrix norm on X T X (40) , there are many different 
upper bounds that we can use. In particular, we choose the 
maximum row norm and consider the following upper bound, 


c ( x ) ^ W m f x X \XikXjk | 


(15) 


Using this upper bound, we can identify many below-the- 
threshold matrices before computing the SVD’s at all. In 
practice, we found that the above trick provides a modest 
speedup of 3 ~ 5 x. 


IX. Heuristics for translation invariant 

DECOMPOSITION 

Similar to wavelet transforms, one drawback of the multi¬ 
scale low rank decomposition is that it is not translation 
invariant, that is, shifting the input changes the resulting 
decomposition. In practice, this translation variant nature often 
creates blocking artifacts near the block boundaries, which 
can be visually jarring for image or video applications. One 
solution to remove these artifacts is to introduce overlapping 
partitions of the matrix so that the overall algorithm is transla¬ 
tion invariant. However, this vastly increases both memory and 
computation especially for large block sizes. In the following, 
we will describe a cycle spinning approach that we used 
in practice to reduce the blocking artifacts with only slight 
increase in per-iteration computation. 

Cycle spinning G2) has been commonly used in wavelet 
denoising to reduce the blocking artifacts due to the translation 
variant nature of the wavelet transform. To minimize artifacts, 
cycle spinning averages the denoised results from all possible 
shifted copies of the input, thereby making the entire process 
translation invariant. Concretely, let S be the set of all shifts 
possible in the target application, Shift s denote the shifting 
operator by s , and Denoise be the denoising operator of 
interest. Then the cycle spinned denoising of the input X is 
given by: 

A Y Shift_ s (Denoise(Shift s (X))) (16 ) 
1*1 
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Fig. 7. An example of the multi-scale low rank decomposition with and 
without random cycle spinning. Each blob in the input matrix Y is a rank-1 
matrix constructed from an outer product of hanning windows and is placed at 
random positions. Blocking artifacts can be seen in the decomposition without 
random cycle spinning while vastly diminished in the random cycle spinned 
decomposition. 


In the context of multi-scale low rank decomposition, we can 
make the iterative algorithm translation invariant by replacing 
the block-wise singular value thresholding operation in each 
iteration with its cycle spinning counterpart. In particular, for 
our ADMM update steps, we can replace the Zi step to: 

X t4t 57 Shift_ s (Blocks VT A . /p (SHiFT s (Xj + £/*))) 

I I ses 

(17) 

To further reduce computation, we perform random cycle 
spinning in each iteration as described in Figueiredo et al. ED, 
in which we randomly shifts the input, performs block-wise 
singular value thresholding and then unshifts back: 

Zi <r- Shift_ s (BlocksVT A . / p (SHiFT s (X i + Uf})) (18) 

where s is randomly chosen from the set S. 

Using random cycle spinning, blocking artifacts caused 
by thresholding are averaged over iterations and in practice, 
reduces distortion significantly. Figure [7] shows an example 
of the multi-scale low rank decomposition with and without 
random cycle spinning applied on a simulated data that does 
not fall on the partition grid. The decomposition with random 
cycle spinning vastly reduces blocking artifacts that appeared 
in the one without random cycle spinning. 


X. Applications 


To test for practical performance, we applied the multi¬ 
scale low rank decomposition on four different real datasets 
that are conventionally used in low rank modeling: illumi¬ 
nation normalization for face images (Sect ion |X-A| ), motion 
separation for surveillance videos (Section |~X-B }, multi-scale 
modeling of dynamic contrast enhanced magnetic resonance 
imaging (Section |X-Q and collaborative filtering exploiting 
age information (Section |X-D| ). We compared our proposed 
multi-scale low rank decomposition with low rank + sparse 
decomposition for the first three applications and with low 
rank matrix completion for the last application. Randomly cy¬ 
cle spinning was used for multi-scale low rank decomposition 
for all of our experiments. Regularization parameters A i were 
chosen exactly as y/ml + y/nf + yJ\og{MN j max(ra*, r^)) 
for multi-scale low rank and max(m^ni) for low rank + 


sparse decomposition. Our simulations were written in the 
C programming language and ran on a 20-core Intel Xeon 
workstation. Some results are best viewed in video format, 
which are available as supplementary materials. 

In the spirit of reproducible research, we provide a software 
package (in C and partially in MATLAB) to reproduce most 
of the results described in this paper. The software package 
can be downloaded from: 

https://github.com/frankong/multi_scale_low_rank.git 


A. Multi-scale Illumination Normalization for Face Recogni¬ 
tion Pre-processing 

Face recognition algorithms are sensitive to shadows or 
occlusions on faces. In order to obtain the best possible 
performance for these algorithms, it is desired to remove 
illumination variations and shadows on the face images. Low 
rank modeling are often used to model faces and is justified 
by approximating faces as convex Lambertian surfaces ED- 

Low rank + sparse decomposition [[25) was recently pro¬ 
posed to capture uneven illumination as sparse errors and 
was shown to remove small shadows while capturing the 
underlying faces as the low rank component. However, most 
shadows are not sparse and contain structure over different 
lighting conditions. Here, we propose modeling shadows and 
illumination changes in different face images as block-low 
rank as illumination variations are spatially correlated in 
multiple scales. 

We considered face images from the Yale B face 
database (42]]. Each face image was of size 192 x 168 with 64 
different lighting conditions. The images were then reshaped 
into a 32, 256 x 64 matrix and both multi-scale low rank and 
low rank + sparse decomposition were applied on the data 
matrix. For low rank + sparse decomposition, we found that 
the best separation result was obtained when each face image 
was normalized to the maximum value. For multi-scale low 
rank decomposition, the original unsealed image was used. 
Only the space dimension was decimated as we assumed 
there was no ordering in different illumination conditions. The 
multi-scale matrix partition can be visualized as in Figure [4] 

Figure [8] shows one of the comparison results. Multi-scale 
low rank decomposition recovered almost shadow-free faces. 
In particular, the sparkles in the eyes were represented in 
the 1 x 1 block size and the larger illumination changes 
were represented in bigger blocks, thus capturing most of the 
uneven illumination changes. In contrast, low rank + sparse 
decomposition could only recover from small illumination 
changes and still contained the larger shadows in the globally 
low rank component. 


B. Multi-scale Motion Separation for Surveillance Videos 

In surveillance video processing, it is desired to extract 
foreground objects from the video. To be able to extract 
foreground objects, both the background and the foreground 
dynamics have to be modeled. Low rank modeling have been 
shown to be suitable for slowly varying videos, such as 
background illumination changes. In particular, if the video 
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Fig. 8. Multi-scale low rank versus low rank + sparse on faces with uneven illumination. Multi-scale low rank decomposition recovers almost shadow-free 
faces, whereas low rank + sparse decomposition can only remove some shadows. 



Fig. 9. Multi-scale low rank versus low rank + sparse decomposition on a surveillance video. For the multi-scale low rank, body motion is mostly captured in 
the 16 x 16 x 16 scale while fine-scale motion is captured in 4 x 4 x 4 scale. Background video component is captured in the globally low rank component 
and is almost artifact-free. Low rank + sparse decomposition exhibits ghosting artifacts as pointed by the red arrow because they are neither globally low 
rank or sparse. 


background only changes its brightness over time, then it can 
be represented as a rank-1 matrix. 

Low rank + sparse decomposition (25) was proposed to 
foreground objects as sparse components and was shown to 
separate dynamics from background components. However, 
sparsity alone cannot capture motion compactly and often 
results in ghosting artifacts occurring around the foreground 
objects as shown in Figure [9] Since video dynamics are 
correlated locally at multiple scales in space and time, we 
propose using the multi-scale low rank modeling with two 
sided decimation to capture different scales of video dynamics 
over space and time. 

We considered a surveillance video from Li et al. (43) . Each 
video frame was of size 144 x 176 and the first 200 frames were 
used. The video frames were then reshaped into a 25, 344 x 200 


matrix and both multi-scale low rank and low rank + sparse 
decomposition were applied on the data matrix. 

Figure [9] shows one of the results. Multi-scale low rank de¬ 
composition recovered a mostly artifact free background video 
in the globally low rank component whereas low rank + sparse 
decomposition exhibits ghosting artifact in certain segments of 
the video. For the multi-scale low rank decomposition, body 
motion was mostly captured in the 16 x 16 x 16 scale while 
fine-scale motion was captured in 4 x 4 x 4 scale. 

C. Multi-scale Low Rank Modeling for Dynamic Contrast 
Enhanced Magnetic Resonance Imaging 

In dynamic contrast enhanced magnetic resonance imaging 
(DCE-MRI), a series of images over time is acquired after a T\ 
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Fig. 10. Multi-scale low rank versus low rank + sparse decomposition on a dynamic contrast enhanced magnetic resonance image series. For the multi-scale 
result, small contrast dynamics in vessels are captured in 4 x 4 blocks while contrast dynamics in the liver are captured in 16 x 16 blocks. The biggest block 
size captures the static tissues and interestingly the respiratory motion. In contrast, the low rank + sparse modeling could only provide a coarse separation of 
dynamics and static tissue, which result in neither truly sparse nor truly low rank components. 


contrast agent was injected into the patient. Different tissues 
then exhibit different contrast dynamics over time, thereby 
allowing radiologists to characterize and examine lesions. 
Compressed sensing Magnetic Resonance Imaging (44) is now 
a popular research approach used in three dimensional DCE- 
MRI to speed up acquisition. Since the more compact we 
can represent the image series, the better our compressed 
reconstruction result becomes, an accurate modeling of the 
dynamic image series is desired to improve the compressed 
sensing reconstruction results for DCE-MRI. 

When a region contains only one type of tissue, then the 
block matrix constructed by stacking each frame as columns 
will have rank 1. Hence, low rank modeling (TOj, and locally 
low rank modeling j45j have been popular models for DCE- 
MRI. Recently, low rank + sparse modeling 63 have also 
been proposed to model the static background and dynamics 
as low rank and sparse matrices respectively. However, dy¬ 
namics in DCE-MRI are almost never sparse and often exhibit 
correlation across different scales. Hence, we propose using a 
multi-scale low rank modeling to capture contrast dynamics 
over multiple scales. 

We considered a fully sampled dynamic contrast enhanced 
image data. The data was acquired in a pediatric patient with 
20 contrast phases, 1 x 1.4 x 2 mm 3 resolution, and 8s temporal 
resolution. The acquisition was performed on a 3T GE MR750 
scanner with a 32-channel cardiac array using an RF-spoiled 
gradient-echo sequence. We considered a 2D slice of size 
154 x 112 were then reshaped into a 17, 248 x 20 matrix. Both 
multi-scale low rank and low rank + sparse decomposition 
were applied on the data matrix. 

Figure [TO] shows one of the results. In the multi-scale low 
rank decomposition result, small contrast dynamics in vessels 
were captured in 4 x 4 blocks while contrast dynamics in 
the liver were captured in 16 x 16 blocks. The biggest block 
size captured the static tissues and interestingly the respiratory 


motion. Hence, different types of contrast dynamics were 
captured compactly in their suitable scales. In contrast, the low 
rank + sparse modeling could only provide a coarse separation 
of dynamics and static tissue, which resulted in neither truly 
sparse nor truly low rank components. 


D. Multi-scale Age Grouping for Collaborative Filtering 



Star Wars (1977) 




Users sorted by age 


Gone with the Wind (1939) 




Users sorted by age 


Age group size 


Reconstructed 


Users sorted by age 


Fig. 11. Multi-scale low rank reconstructed matrix of the 100K MovieLens 
dataset. The extracted signal scale component captures the tendency that 
younger users rated Star Wars higher whereas the more senior users rated 
Gone with the Wind higher. 


Collaborative filtering is the task of making predictions 
about the interests of a user using available information from 
all users. Since users often have similar taste for the same 
item, low rank modeling is commonly used to exploit the data 
similarity to complete the rating matrix (M), (22), (23). On 
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the other hand, low rank matrix completion does not exploit 
the fact that users with similar demographic backgrounds have 
similar taste for similar items. In particular, users of similar 
age should have similar taste. Hence, we incorporated the pro¬ 
posed multi-scale low rank modeling with matrix completion 
by partitioning users according to their age and compared it 
with the conventional low rank matrix completion. Our method 
belongs to the general class of collaborative filtering methods 
that utilize demographic information j32j. 

To incorporate multi-scale low rank modeling into matrix 
completion, we change the data consistency constraint in prob¬ 
lem ^ to [Y]jk = E Xi]jk for observed jk entries, and 
correspondingly, the update step for {Xi}f =1 in equation (l3| 
is changed to [Xi\ jk <- [{Zi~Ui) + ^(Y -f l f =1 (Z i -U i ))]jk 
for observed jk entries and [Xj\jk <— [Zi — Ui\jk for 
unobserved jk entries. We emphasize that our theoretical 
analysis does not cover matrix completion and the presented 
collaborative filtering application is mainly of empirical inter¬ 
est. 

To compare the methods, we considered the 100K Movie- 
Lens dataset, in which 943 users rated 1682 movies. The 
resulting matrix was of size 1682 x 943, where the first 
dimension represented movies and the second dimension rep¬ 
resented users. The entire matrix had 93.7% missing entries. 
Test data was further generated by randomly undersampling 
the rating matrix by 5. The algorithms were then run on the 
test data and root mean squared errors were calculated over 
all available entries. To obtain a multi-scale partition of the 
matrix, we sorted the users according to their age along the 
second dimension and partitioned them evenly into age groups. 

Figure |TT] shows a multi-scale low rank reconstructed user 
rating matrix. Using multiple scales of block-wise low rank 
matrices, correlations in different age groups were captured. 
For example, one of the scales shown in Figure [TT] captures the 
tendency that younger users rated Star Wars higher whereas the 
more senior users rated Gone with the Wind higher. The multi¬ 
scale low rank reconstructed matrix achieved a root mean- 
squared-error of 0.9385 compared to a root mean-squared- 
error of 0.9552 for the low rank reconstructed matrix. 

XI. Discussion 

We have presented a multi-scale low rank matrix decom¬ 
position method that combines both multi-scale modeling 
and low rank matrix decomposition. Using a convex formu¬ 
lation, we can solve for the decomposition efficiently and 
exactly, provided that the multi-scale signal components are 
incoherent. We provided a theoretical analysis of the convex 
relaxation for exact decomposition, which extends the analysis 
in Chandrasekaren et al. gt and an analysis for approximate 
decomposition in the presence of additive noise, which extends 
the analysis in Agarwal et al. {18). We also provided empirical 
results that the multi-scale low rank decomposition performs 
well on real datasets. 

We would also like to emphasize that our recommended 
regularization parameters empirically perform well even with 
the addition of noise, and hence in practice does not require 
manual tuning. While some form of theoretical guarantees for 


the regularization parameters are provided in the approximate 
decomposition analysis, complete theoretical guarentees are 
not provided, especially for noiseless situations, and would be 
valuable for future work. 

Our experiments show that the multi-scale low rank decom¬ 
position improves upon the low rank + sparse decomposition in 
a variety of applications. We believe that more improvement 
can be achieved if domain knowledge for each applications 
is incorporated with the multi-scale low rank decomposition. 
For example, for face shadow removal, prior knowledge of 
the illumination angle might be able to provide a better multi¬ 
scale partition. For movie rating collaborative filtering, general 
demographic information and movie types can be used to 
construct multi-scale partitions in addition to age information. 

Appendix A 
Proof of Theorem IV. II 

In this section, we provide a proof of Theorem [V.l| and show 
that if {Xi}f =1 satisfies a deterministic incoherence condition, 
then the proposed convex formulation ^ recovers {Xi}f =1 
from Y exactly. Our proof makes use of the dual certificate 
common in such proofs. We will begin by proving a technical 
lemma collecting three inequalities. 

Lemma A.l. For i = 1, ..., L, the following three inequalities 
hold, 

\\V Ti {X)\\* {i) < ||X||fo for any matrix X (19) 

||72pj-(X)||*^ < ||X||*-) for any matrix A (20) 

II^IICO <Mij \\Nj llfo for j % . and N 0 £ T 0 (21) 

Proof. To show the first inequality ( p~9] ), we recall that 
|| X ||jU = maXfr G p. ||i2b(A)|| msv . Then, using the variational 
representation of the maximum singular value norm, we ob¬ 
tain, 

\\V Ti { x )\\h = max max u T R b (V Ti (X))v 

v J b£Pi u,v 

= max max u T R b (X)v 

bePi ueco\(Rb(Xi )) or 
vGrow (R b (Xi)) 

< max max u r R b (X)v = ||X||/-x 

b£Pi u,v ^ ' 

where col and row denote the column and row spaces respec¬ 
tively. 

Similarly, we obtain the second inequality ( [20] ): 

||7 ? t x(X)||^ =max max u T R b (X)v 

bEPi uEcoU (R b (Xi)) and 
vErow ± (R b (Xi)) 

< max max u T R b (X)v = ||X||jU 

bePi u,v v ' 

The third inequality © follows from the incoherence 
definition that fiij > \\Nj ||*^/|| Nj ||*^ for any non-zero Nj. 

□ 

Next, we will show that if we can choose some parameters 
to “balance" the coherence between the scales, then the block- 
wise row/column spaces {Ti}f =1 are independent, that is 
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Y^i= i G is a direct sum. Consequently, each matrix N in the 
span of {Ti}f =1 has a unique decomposition N = 1 

where Ni G 7E 

Proposition A.2. If we can choose some positive parameters 
{A;}^ such that 

E^'V <:L ’ = ( 22 ) 

. / . A 


definition of subgradient and the fact that A^ = 0, we 

have, 

L L 

E Ai \\Xi + Aill (i) > E Ai \\Xi II w + (Ai, Gi) 

i= 1 i= 1 

L 

= ^ 5 Gi) ~ (Ai , Q) 

2 = 1 


then we have 

Ti n Tj = { 0 }, for i = 1 ,..., L (23) 

i/* 

In particular when L = 2, the condition on {/ii 2 ,M 2 i} 
reduces to H 12 H 21 < 1, which coincides with Proposition 1 in 
Chandrasekaren et al. 03- We also note that given fiij , we can 
obtain {Ai}^ that satisfies the condition J2j^i Tij^j < ^2 
by solving a linear program. 

Proof. Suppose by contradiction that there exists {A i}f =1 such 
that Y,j^i < 1 , but Ti H Y,j^i T j / { 0 }. Then there 

exists {Ni G 7~-}f =1 such that A^TV^ = 0 and not all 

TV* are zero. But this leads to a contradiction because for i = 
1 

HVII’n h E^ ; w 

j/* 

<Ev^ll^llo) ( 24 ) 

< (E ^A') II'V)-ll(i) (25) 

< max ||V?||*-) (26) 

where we have used equation © for the first inequal- 
ity © Holder’s inequality for second inequality ( [25] ) and 
J2j^i Tij Aj / A^ < 1 for the last inequality. Hence, none of 
{||^ll(i)}?=i * s largest of the set, which is a contradiction. 

□ 

Our next theorem shows an optimality condition of the 
convex program {2^ in terms of its dual solution. 

Theorem A.3 (Lemma 4.2 1331). {Xi}f =1 is the unique 
minimizer of the convex program if there exists a matrix 
Q such that for i = 1 ,..., L, 

1 ) V Ti {Q) = \Ei 

2 ) ||^(Q)ll( i )<A i 

Proof Consider any non-zero perturbation {A i}f =1 to 
{Xi}f =1 such that {Xi + Ai}f =1 stays in the feasible set, that 
is J2i =1 A* = 0. We will show that Yn=i Ai||AQ + > 

We first decompose A z into orthogonal parts with re¬ 
spect to Ti, that is, A i = (A f) + V T ±(Af). We 
also consider a specific subgradient G = [Gi---Gl] t of 
£f=i Aill • ||(i) at {Xi}f =1 such that \\V T , (GOIlfo < A, and 
{V t j- (Ai), V T ± (Gi)) = A||P T j.(A i )|l( i) . Then, from the 


Applying the orthogonal decomposition with respect to Ti and 
using V Ti (Gi ) = V Ti (Q) = XiEi, we have, 


L 


L 


E Aj||X* + Ai|| ( i) > E Ai||Ai|| ( j) + {V T ±(Ai) ,V T ±(Gi)} 

2=1 2=1 

-(^(a i),r T ±m 


Using Holder’s inequality and the assumption for the subgra¬ 
dient Gi, we obtain, 

L L 

EAipQ + Ai|| (i) > E Aill Aill (i) +Ai||iP^(A i )||(i ) 
-||^(Q)ll(i)ll^(Ai)||(i) 

L 

>EAi||Ai|| ( i) 


□ 


prove Theorem V.l 


With Proposition A.2 and Theorem A.3 we are ready to 


Proof of Theorem \V.J 
Proposition 


A.2 


_ Since Tij^j/\ < V 2 > b Y 

Ti H X ^2 Tj = {0} for all i. Thus, there is 

a unique matrix Q in Jf i=1 Ti such that VtXQ) = A iEi. In 
addition, Q can be uniquely expressed as a sum of elements 
in Ti. That is, Q = J2i=i Qi wbb Qi C Ti. We now have 
a matrix Q that satisfies the first optimality condition. In 
the following, we will show that it also satisfies the second 


optimality condition \\Vj 


1 ( 2 ) 


< A*. 


If the vector spaces {Ti}f =1 are orthogonal, then Qi is 
exactly A iEi. Because they are not necessarily orthogonal, we 
express Qi as \Ei plus a correction term AThat is, we 
express Qi = A i(Ei +ef). Putting Qf s back to Q, we have 


Q — A i(Ei + ef) 


(27) 


2=1 


Combining the above equation ( [27] ) with the first optimality 
condition ( |A.3| ), VtXQ) = A iEi, we have J2f=i ^j'TrAEj + 
tj) = A iEi. Since VrXEi + tf) — Ei T ti, rearranging the 
equation, we obtain the following recursive expression for tf. 


A, 


ti = -'PT i Er^' + ^ 


(28) 


v 
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We now obtain a bound on \\'P T ±(Q)\\* i ^ in terms of 

11^(3)11(0 = \\V T ±(52\i{E j + e J ))||( i) 

j/* 

<PE A i(^+^)ii ? 0 

i/* 

< Ef<0^(1 + ll e ill(i)) 


(29) 

(30) 


< + Iloilo-)) ( 31 ) 


where we obtain equation ( |29| ) from equation ( |20| ), equation 
( [30] ) from equation © and the last inequality © from 
Holder’s inequality. 

Similarly, we obtain a recursive expression for 1 + ||e^||*^ 
using equation ([28} 


A,- 


1 +ii e di*i) — 1 +n^Tj(y^ + e . 


■j))w (») 


a 7 - 


A 1 + II E ■f‘(- E A + e 


J9II(0 




A 1 + E + ll e i ll(j)) 


(32) 

(33) 




< i + (E^ir) + Iloilo)) (34) 




where we obtain equation ( [32] ) from equation ( fl9| ), equation 
© from equation © and the last inequality ( [34] ) from 
Holder’s inequality. 

Taking the maximum over i on both sides and rearranging, 
we have 

mfx(l + IMI(i)) < --^ 

* 1 - maxj Z&i nu 

Putting the bound back to equation (f3l~|> , we obtain 


PV(Q)ll( 0 <A. 


Mij Ai 


1 ( 35 ) 


< A,- 


where we used /Ai < 1/2 in the last inequality. 

Thus, we have construct ed a d ual certificate Q that satisfies 
the optimality conditions \A3\ and {AQ}^ is the unique 
optimizer of the convex problem ([ 2 ]). 

□ 


Appendix B 
Proof of Theorem [VI] 


In this section, we provide a proof of Theorem VI showing 
that as long as we can choose our regularization parameters 
accordingly, we obtain a solution from the convex program © 
that is close to the ground truth {Xi}f =1 . 

We will begin by proving a technical lemma collecting three 
inequalities. Throughout the section, we will assume Xz is 
non-zero for simplicity, so that the subgradient of ||X^||f ro is 
exactly X z /\\X z \\ fm . 


Lemma B.l. For i = 1,... 5 L, the following three inequalities 
hold, 

11**11(0 - ll*i + Aillw < ||7V l (A i )|| (i) - ||'P T j-(A,)|| ( j ) 

(36) 

L L 

EAIIPt^AOIIw ^EAllPT^AOIto) (37) 

i=1 i=1 

II^t^AOIIw < 1% E T ’f<llA*||fr 0 (38) 

V h ^ p i 


Proof. We will prove the inequalities in order. 

Let us choose a subgradient Gi = Ei + Wi of ||Ay||^) at 
Xi such that (Wi , V T ± (A*)) = \\'P T ±(A i )\\( i y Then, from 
the definition of the subgradient, we have, 


II Xi + AiU^) > II^H^) + {Gi , A*) 

= ||Aj|| (i) + {Ei,V Ti ( A*)) + || ^(AOIIw 

> 11**11(0 - l|7 ? T i (A i )||( i ) + ||^ T x(A j )||(o 

(39) 


where we used Holder’s inequality for the last inequality ( f39| >. 
Re-arranging, we obtain the first result < [36l >. 

For the second inequality, we note that since 1 A* + 
Aj + Xz + A 2 = Y, we have A z = — XAi A*- From the 
definition of subgradient, we obtain, 


A* ||*2 + A^ 11fro > A^H-X’sllfro + As( 


X, 


\\Xz\\ 


fro 


, A z ) 


= A^s l\Xz ||f ro — E Az( 


As 


i= 1 
L 


II As || 


fro 


,A i) 


> As ||A_s || fr0 ~ E Az ||jrJ|||/ H A *Ho) 




(40) 


>A 2 ||A 2 || fro -^||A i || (i) (41) 

> \ z \\X z \\fto ~ E y ll^’r i (A i )||( i ) 

i=1 

-|||P^(A i )|| (i) (42) 


where we obtain equation ( |4Q| ) from Holder’s inequality, 
equation © from the condition of A* ^ and equation ( [42] ) 
from the triangle inequality. 

Since {Xi + A i}f =1 and X Z + X Z achieves the minimum 
objective function, we have, 


L 

As||A 2 || fr0 + y>||X;|| (i) 

i=1 


L 

> AzllXz + A^ 11 fro + E + A *ll(0 
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Substituting equation ( [39] ) and ( |42| ), we obtain, 

L 

XzWXzU + ^iWXihi) 

2 = 1 

> \z\\X Z \\fto ~ 22 y ll^( A i)ll(i) - y II^T A (Ai)||(i) 

(43) 

L 

+ 22 A *H^HW “ A i||7'V i (Ai)||(i) + A i|I^T i J -(A*)||(j) 

i=l 

Cancelling and re-arranging, we obtain the desired inequal- 
ity 1(37} , 

L L 

^Aill^^AOHw (Ai)|| (i) 

2=1 2=1 

For the third inequality, recall that for any rank -r matrix 
X , its nuclear norm ||X|| nuc is upper bounded by v^ll^lko- 
Moreover, the projection of any matrix Y to the column and 
row space T of a rank r matrix is at most rank-2 r, that is 
rank (Vt(Y)) < 2r. Hence, we obtain, 

||P Ti (A i )|| (i) = 22 ||i? 6 (PT i (A i ))||„uc 
be Pi 

< ^ V / 2 r 6 ||^ 6 (Ai)||f ro 
be Pi 

— , 2 rb||Aj||fro 

V bePi 

where the last inequality follows from Cauchy-Schwatz in¬ 
equality and the fact that ]T 6eP . ||ii&(A i )||g 0 = 1^11^. □ 


with \\X Z + A^llfto + ll^zllfro. Then, using (a + b) (a - 6 ) = 
a 2 — b 2 , we expand the left hand side as: 

L.H.S. = \\Xz + A^llfro — ll-Xzllfio 
= IIA^IIfto + 2(Xz,X z ) 

Recall that A z — ~ Ym=\ we obtain the following lower 
bound for the left hand side: 

L 

L.H.S. = ||A^||f ro — 2(X Z , 

>||A 2 ||| 0 - 2 ^||X 2 ||^ ) ||A i || (i) (45) 

2 = 1 

>||A 2 ||| 0 -S^yj Ai || Ai || (i) (46) 

> HA.zllfto - - *J lfto (A r + A t j.) (47) 

where we used Holder’s inequality for equation <E3>, the 
condition for A^ for equation ( [46]) , and the triangle inequality 
for ( |47| ). 

We now turn to upper bound the right hand side. We know 
||Xz + A^Hfro < 11 Xz 11fro + (At — At-l )/^z from equation 
([44]). Hence, we obtain, 

R.H.S. = (\\X Z + A^Hfro + IIX 2 || fro )2 -(A t - A r ^) 

< ( 2 ||||fro + t—(A t — A t a))-—(A T — A t a) 


With these three inequalities, we now proceed to prove 
Theorem IVI1 

Proof of Theorem \VI\ From the optimality of {Xi + A^}|L 1 , 
we have the following inequality, 

L 

^z\\Xz + A^llfo, + + A^||^) 

2=1 

L 

< Az||^|| fro + y]A,||A,|| ( , ) 

2=1 

Re-arranging, we obtain, 


|| Xz + A^Hfro — 11 Xz 11fro 
<T^ A .(|| A .||(.)-|| A . + A i ||( i) ) 


For convenience, let us define At = J2i =1 Wl'PTii^i 


and A t a — J2i=i 
equation ([36]), we obtain, 


(A, 


Then, from Lemma 


m 


B.l 


||Xz + A^llfro — 11 Xz 11 fro < t—(A p — A T ±) (44) 

Az 


We would like to keep only A z on the left hand side and 
cancel Xz - To do this, we multiply both sides of equation <|44} 


Using Lemma [BTT| equation ([37}, we have, 

R.H.S. < 2 J1^I^ (A t - A t x) + 4"( a t - At-l) 2 
A z A z 

<2^f^(A T -A T ,) + wfA 2 T (48) 

Combining and simplifying the lower bound ( [47] ) and the 
upper bound ( [48] ), we obtain, 

IIA 2 IIL < 3^^At + ieL A 2, (49) 

We will now lower bound || ||f ro = || Y,2i A,:||f ro by 
individual terms: 

11 22 a*ii&o = 22 iiA*iifr 0 + (a* , 22 Aj) 

2=1 2=1 J #2 

^EllA.llL-MwEllAillw 

2=1 j /2 

where we used Holder’s inequality for the last inequality. 
Now, using the assumption that both ||Xj)|^ and ||Xj + 
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Aj ||*^ are bounded by Oiij. We have, 

||^A l ||L>Ell A ^lfro-||A i |lwE 2 ^ 

i=l j^i 

>EHA i ||L-A i ||A i || (i) 


i=1 


i =1 
L 


> E IIAill&o - At - A; 


T A 


i=l 

L 


>El|A i ||| 0 -4A I 


(50) 


(51) 


i=l 


where we used the triangle inequality for equation ( [50] ) and 
Lemma [B~T| equation ( [37] ) for equation ( [51] ). 

Substituting the lower bound back to equation ( |49| ), we have 

£ II A.IIL < (3&!^ + 4)A T + 16 4-4 (52) 


i=l 


’*% T 


We now turn to upper bound the equation. From 
Lemma EL equatio n ( [38] ), we know that At < 
i Aza /2 ^bePi 11 fr 0 . Hence, we have, 


i 11 fro 


16A£<16 E A M 2 L r6 HA 

\i=l Y 

^ 16 fe^E^l Lii A 

\*=i bePi 

— ^2 ll^llfro 


i 11 fro 


i= 1 


(53) 


(54) 


i= 1 


where we used Cauchy-Schwartz’s inequality for equation ( [53] ) 
and the condition for A^: for equation 0 Hence, substituting 
back to equation ( [52] ), rearranging and ignoring constants, we 
have, 


£» A 

i= 1 


? < 


ill fro 


11*2 II 


fro 


As 


E\/E^iia 

be Pi 


i 11 fro 


i= 1 


Completing the squares with respect to ||A^||f ro gives us, 

11*2 11 fro 


E 


I fro - 


i= 1 


A i 


< 




2 


Z— 1 \ 




Using the triangle inequality to lower bound the left hand 
side, we obtain 

L 


£ll A 

i=1 


i 11 fro 


11 *2 11 fro 

A^ 


S/E r o 

V bePi 


< 




ii* 


2 fro 




a n/E^ 




Using the fact that ^1-norm is larger than the ^2-norm, and 
re-arranging give us the desired result, 


EUAillfro 




< 


ll^llfro 

A^ 




i= 1 


bePi 


n 


□ 
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